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PREVENTION OF DEADLOCKS AND LIVELOCKS IN LOSSLESS, 
BACKPRESSURED PACKET NETWORKS 

CROSS REFERENCE TO RELATED APPLICATION 

This application claims priority of Provisional Application Serial No. 60/159147 
which was filed on October 13, 1999. 

FIELD OF THE INVENTION 

The present invention relates generally to packet telecommunications networks, and 
in particular, to control of interconnected nodes in such packet networks in which 
deadlocks, iivelocks and buffer overflow (packet loss) are avoided by providing a selective 
backpressure or feedback signal from a receiving node to a sending node having packets to 
send to the receiving node. 

BACKGROUND OF THE INVENTION 

Congestion occurs in packet networks when the demand exceeds the availability of 
network resources, leading to lower throughputs and higher delays. If congestion is not 
properly controlled, some packets will not get to their destinations, or will be unduly 
delayed. As a result, various applications requiring the information contained in the packets 
transported by the network may not meet their quality-of-service (QoS) requirements. 

Proper congestion control is especially an important topic in emerging local area 
networks (LANs): large LANs with a heterogeneous mix of link speeds (ranging, for 
example, from 10 Mbps up to 1 Gbps) and the need to support the quality-of-service (QoS) 
requirements of multimedia applications from hundreds or even thousands of users. 

When congestion builds up in a packet network, two general approaches are 
possible to cope with the shortage of buffer space. One approach is to drop incoming 
packets for which buffer is not available and to rely on the end-to-end protocols for the 
recovery of lost packets. Dropped packets are later retransmitted by end-to-end protocols. 
In many situations, dropping packets is wasteful, since the dropped packet has already 
traversed a portion of the network. Retransmissions of dropped packets also may lead to 
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unnecessarily large end-to-end delays. 

An alternative approach is to insist that no packets should be dropped inside a 
packet network, even when congestion builds up. One way to accomplish this goal is to 
have the congested nodes send backpressure feedback to neighboring nodes, informing 
5 them of unavailability of buffering capacity and in effect stopping them from forwarding 
packets until enough buffer becomes available. While backpressure can be a useful 
approach, this method of dealing with congestion can potentially lead to deadlocks and 
livelocks in the network. 

A deadlock is a condition under which the throughput of the network, or part of the 

10 network, goes to zero due to congestion (i.e., no packets are transmitted). This can be 
explained by reference to Fig. 1, which shows three interconnected nodes A, B and C. 
(Note that A, B and C can be any three adjacent nodes that form a "cycle" in a larger 
network of nodes.) Assume node A's buffer is full with packets destined for node B and 
beyond. Accordingly, node A sends a "stop" signal to its upstream nodes, in this example, 

15 node C. Likewise, if node B's buffer is full with packets destined for node C and beyond, 
and node C's buffer is full with packets destined for node A and beyond, both node B and C 
also send "stop" signals to their respective upstream nodes, in this example, node A and B. 
Under these circumstances, there is entire stoppage because each node in the loop has been 
directed to "stop" transmission. Stated another way, each network process, having 

20 resources required by the others, refuses or neglects to release them, causing the other 
processes to wait indefinitely. 

In a livelock situation, the network is not stopped, but one or more individual 
packets are never transmitted. Fig. 2 shows a simple example of a livelock that can occur 
when a node 201 with separate high and low priority buffers 202 and 203, respectively, uses 

25 a strict-priority output link scheduling algorithm: low-priority packets in buffer 203 are 
forced to indefinitely wait in buffer 203 as new higher-priority packets continue to arrive at 
buffer 202 in node 201 . These higher-priority packets get serviced, to the exclusion of 
packets in buffer 203. Even if all the link scheduling algorithms in a network are well- 
behaved (i.e., they don't indefinitely neglect the transmissions of any particular packets), 
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livelocks can still occur as a result of "network-level" effects. For instance, a hop-count- 
based flow control protocol might cause some node to (indefinitely) block its transmission 
of packets that have traveled fewer than a specified number of hops. From the foregoing, it 
is seen that Kvelocks are an extreme example of "unfairness," in which packets are "stuck" 
5 as they wait indefinitely in a queue while other packets (including new arrivals) are 
continually served. 

Many different types of deadlocks have been identified and studied, along with 
methods of preventing them. (See, e.g., "System Deadlocks" by E. G. Coffman et al, 
Computing Surveys, Vol. 3, pp. 67-78, June 1971; "Some Deadlock Properties of 
10 Computer Systems" by R. C. Holt, Computing Surveys, Vol. 4, pp. 179-196, September 
□ 1972; "Deadlock Avoidance in Store-and-Forward Networks - 1: Store-and-Forward 

IS Deadlock" by P. M. Merlin et al, IEEE Trans. Commun., Vol. COM-28, pp. 345-354, 

1 2 March 1980; "Prevention of Deadlocks in Packet-Switched Data Transport Systems", by K. 

Q K. Gunther, IEEE Trans, Commun., Vol. COM-29, pp. 512-524, April 1981; "Prevention 

m 15 of Store-and-Forward Deadlock in Computer Networks", by I. S. Gopal, IEEE Trans. 
% Commun., Vol. COM-33, pp. 1258-1264, December 1985; Design and Validation of 

si Computer Protocols by G. J. Holzmann, Englewood Cliffs, NJ: Prentice Hall, 1991.) 

In the LAN context, recent simulation results show that hop-by-hop backpressure 
% can be better than TCP for dealing with short-lived congestion. (See "Selective Back- 

20 Pressure in Switched Ethernet LANs", by W. Noureddine et al., IEEE GLOBECOM'99 
Symposium on High Speed Networks, Dec. 1999.) TCP, the dominant transport protocol in 
the Internet, uses packet drops as an indication of congestion and requires sufficient levels 
of loss in order to be an effective control. In a sense, TCP keeps increasing the load in 
order to increase the loss so that the necessary feedback signals are sent. Hop-by-hop 
25 backpressure helps reduce the number of packets dropped during periods of transient 
congestion, and avoids wasting the network resources so far already consumed up to this 
point. This is particularly important when we consider packets that might arrive at a LAN 
after traversing a wide area network (WAN). The penalty is that some links are "turned 
off' for short periods of time, perhaps negatively impacting other flows (i.e., those not 
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involved in the "congestion") as the backpressure propagates through the network. The 
advantage is that the distributed memory resources in a network can be used to buffer 
excess traffic generated by bursty sources. This helps bypass the costly TCP flow control 
mechanism, which may have a negative performance impact in the LAN context. (See 
5 "Flow Control in ATM networks: a survey" by S. Kamolphiwong et al., Comp. Commun., 
Vol. 21, pp. 951-968, 1998.) 

Although a backpressure congestion control mechanism, in which a node receiving a 
packet controls the stop-start behavior of a node intending to transmit a packet to the 
receiving node, can, in theory, eliminate packet loss in networks, this capability is obviously 

10 gone if packets need to be dropped to prevent deadlocks or to recover from a deadlock 
condition. Proper strategies for dealing with deadlocks increase in importance as data 
transmission rates increase to gigabit-per-second (and higher) rates. For a given network 
load level, the number of potential deadlocks per hour that have to be prevented, avoided, 
or recovered from, increases in proportion to the transmission rate. In addition, potential 

15 deadlocks will occur more frequently as the network loading increases. So, for instance, as 
gigabit Ethernet links are extended to cover greater distances in metropolitan-area networks 
(MANs) and WANs, heavier loading of long-distance links to make efficient use of the links 
will cause potential deadlocks to occur more frequently. Currently, proprietary hardware 
and high-quality fiber-optic lines can extend the distances between switches to greater than 

20 70 km, permitting gigabit Ethernet implementations across MANs and even WANs. (See 
"Gigabit Ethernet ventures into the land beyond the Lan" by J. Caruso, Network World, p. 
36, May 1999; and "Intelligent DWDM takes Gigabit Ethernet to the MAN" by N. 
Margalit, Lightwave, p. 101, June 1 999 . ) 

One simple way to deal with deadlocks is to just start dropping packets once a 

25 deadlock has occurred, or is "about to occur" (i.e., as congestion increases). For certain 
types of packets (for example, "real-time" traffic and traffic that can permit packet loss), 
this approach works fine. However, packets that must eventually reach their destinations 
will need to be retransmitted if they are dropped to avoid a deadlock or recover from a 
deadlock condition. The impact on total end-to-end delay, including interactions with 
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higher-layer protocols (such as TCP), needs to be considered. Also, retransmission of 
packets might even cause the "deadlock condition" to redevelop. 

Another way to deal with deadlocks is to simply increase the size of available 
bandwidth arid/or buffers. The obvious theory here is that if peak amounts of bandwidth 
5 and buffers are available and are dedicated for all network traffic flows, then buffers do not 
overflow and deadlocks are not created. This approach, however, is typically too wasteful 
of network resources and requires stringent admission control procedures. 

Yet another way to deal with deadlocks is to use up/down routing on a spanning 
tree (see "Autonet: A High-Speed, Self Configuring Local Area Network Using Point-to- 
10 Point Links" by M. D. Schroeder et al., IEEE Jour, Selected Areas Commun., Vol. 9, pp. 
1318-1335, October 1991), thereby assigning directions to the links, and avoiding cycles. 
However, the selected paths (to create the spanning tree) are generally not the shortest and 
links near the root of the spanning tree become bottlenecks, limiting the network 
throughput. 

15 Another way to avoid cycles in the network (which can lead to deadlocks) is to split 

each physical link into a number of virtual channels, each with its own queue and stop -start 
backpressure protocol. (See "Deadlock-Free Message Routing in Multiprocessor 
Interconnection Networks" by W. J. Dally et al., IEEE Trans, Comput., Vol. C-36, pp. 547- 
553, May 1987; and "Congestion Control in Asynchronous, High Speed Wormhole Routing 

20 Networks" by E. Leonardi et al., IEEE Commun. Mag., pp. 58-69, November 1996.) 
Bandwidth is shared among the virtual channels. Layers of acyclic virtual networks and 
deadlock-free routes are created using the virtual channels. However, as the number of 
virtual channels increases, the scheduling becomes more complicated. Also, a method of 
associating individual packets with particular virtual channels is required. 

25 Finally, more sophisticated buffer allocation strategies (structured buffer pools) can 

be used to prevent deadlocks. Extra buffers are allocated for packets that have higher 
"priority" because they have traveled greater distances (e.g., number of hops) in the 
network (or are closer to their destinations). In other words, using "distance" information 
in the packet headers, buffer space is reserved at each network node according to the 
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distance traveled through the network from a source (i.e., the number of hops). This 

technique is again complicated and expensive to implement. 

There are several problems with these prior deadlock prevention techniques. First, 

and most important for the gigabit Ethernet scenario, they may not be compatible with the 
5 IEEE 802. 3z standard. Most prior approaches require packet headers to include the 

"distance information" (such as, for example, the packet's hop count). There is no such 

provision for including distance information in the header of IEEE 802. 3 z packets. 

Alternatively, some other ("non-standard") way would need to be employed for transferring 

the "distance" information downstream to the next node. 
10 A second problem with prior approaches is that although deadlocks are prevented, 

Q the end-to-end packet sequence may not be preserved. This may cause problems for some 

sessions that expect to receive packets in sequence, and perhaps can deal with "missing" 
^ (i.e., dropped) packets better than "out-of-sequence" packets. For instance, it is possible 

□ that a session's packets might be transmitted out-of-sequence to the next node because they 

; n 15 were stored in different buffer classes (even though they all arrive with the same hop- 
;L count). Specific information would need to be kept regarding the order in which packets 

""4 were stored in the various buffer classes. 

i y 

g A third problem with prior approaches is that although they resolve the deadlock 

!i problem, some do not eliminate the possibility of livelock. For example, if the scheduling 

20 algorithm and signaling protocol are not carefully designed, then in the continual presence 
of new arrivals with higher hop counts, packets in lower buffer classes might never have a 
chance to be transmitted. 

A fourth problem with prior approaches is that network nodes need some type of 
"signaling message" to tell upstream neighbors which packets they can transmit 
25 (alternatively, they need a way to send negative acknowledgements when packets are 
dropped). Provision for such a signaling message does not exist in the IEEE 802. 3z 
standard. 

Finally, a fifth problem with some prior approaches is that a method of determining 
the current "distance to destination" information must be available in the network. This gets 
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particularly challenging when the possibility of network reconfigurations and routing table 
updates must be taken into account, which can change the source-to-destination distances. 

SUMMARY OF THE INVENTION 

5 The present invention is used in a packet communication network comprised 

of interconnected nodes arranged to transmit packets of variable length y to adjacent nodes, 
where D is the maximum number of nodes that a packet must traverse through said network 
(i.e., the maximum number of hops) from an originating source to an ultimate destination. 
Each node in the network is, at the same time, both a transmitting node and a receiving 

10 node for its respective output and input links. For purposes of explanation of the present 
invention, the characteristics and arrangement of each node is best described by referring to 
an exemplary pair of nodes as a sending node X t connected to a receiving node R^ via a 
link £ . Each node X t and R ^ includes a buffer for storing packets enroute from the 
originating source node to the ultimate destination node, and management 

15 hardware/software capability to (a) assign a local priority level \ (from amongst at least 
two possible priority levels) at the node to packets stored in the buffer, (b) formulate a 
feedback value f i sent from the receiving node R^ to the sending node X t , that assures 
that there will be room in the buffer in the receiving node R^ to store packets subsequently 
received from the upstream node X* , and (c) transmit from the sending node X/ to the 

20 receiving node R; , only those packets in the buffer in the sending node X^ that are eligible 
for transmission as a result of the fact that the packets have a priority level \ at X^ that 
exceeds the feedback value f t received from the receiving node R^ . 

The priority level \ assigned to packets stored in the buffer at R t is based upon the 
ultimate destination to which the packets are to be transmitted, such that all packets 

25 intended for the same ultimate destination have the same priority level. Therefore, we 

represent the priority level associated with a particular destination d as the destination level 
X d . Initially, the priority level X d is set to 0 for all destinations. When a packet p with 
ultimate destination d arrives at R t from another network node (X t ) over some link £ , the 
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priority level X d at R t associated with d is updated as the maximum of 

(a) the prior value of X d at R f , or 

(b) 1+ f / 

5 

where f^ is the value of the most recent transmit feedback sent over the reverse link 
t ' from node R t \.o node X t . Note that when a packet p with destination d enters the 
network at node n (over some network access link) the destination level X d at node n does 
not change. Note also that when the priority level X d at R t \s increased to 1 + f^ , the 
10 priority levels X? of all packets with ultimate destination d is automatically increased to 1 + 
g f i , which is the new value of X A at R t . 

ry In one embodiment of the present invention, the priority level assigning step is 

SB 

;3 accomplished by assigning a priority level X^ at that is less than or equal to D minus the 

\% number of hops remaining between the receiving node and the ultimate destination. 

^ 15 The feedback value f £ sent from a receiving node R; to a sending node X t is 

U 

^ determined by first setting in the buffer at the receiving node R^ thresholds Bi that limit the 

maximum amount of space for packets with priority levels X d less than or equal to /. At all 
=S times, all B; buffer threshold constraints must be satisfied. This division is not a physical 

partitioning of the buffer space, but is only an allocation of space. Allocation typically 
20 occurs when the system is initialized. 

The receiving node R; thereafter monitors the priority levels X d of arriving and 
departing packets, and the increasing of priority levels Xp of previously-stored packets, and 
thus keeps track of the total space in the buffer at R* occupied by packets of various 
priority levels X d . The feedback f t sent from the receiving node R^ to the sending node X t 
25 represents the lowest priority level of packets that the receiving node R* could accept 
without violating any of the Bi buffer threshold constraints. In other words, the receiving 
node R; has room to accept packets of priority level (1+ f* ) or greater, without violating 
any of the buffer threshold constraints, but the receiving node R* cannot accept packets of 
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priority level f i or lower because it could possibly cause one or more of the buffer threshold 
constraints to be violated. 

The present invention is a lossless method of preventing deadlocks and livelocks in 
backpressured packet networks. In contrast with prior approaches, the present invention 
5 does not introduce any packet losses, does not corrupt packet sequence, and does not 
require any changes to packet headers, such as attaching a hop counter, as required by the 
prior approaches. Because the proposed technique makes use of only the Destination 
Address in each packet header, and because the format of a typical gigabit-Ethernet packet, 
for example, contains the Destination Address but not a hop-counter field, the present 
10 invention can advantageously be used not only in a general packet network, but also in 
o gigabit Ethernet (IEEE 802.3z) networks. In such networks, a PAUSE frame stops the 

!S flow of data frames for specified periods of time (indicated by a parameter in the PAUSE 

IV frame). A PAUSE frame that contains a parameter of zero time allows the flow of data 

i3 frames to restart immediately. 

m 15 BRIEF DESCRIPTION OF THE DRAWING 

H The present invention will be more fully appreciated from a consideration of the 

rj following Detailed Description, which should be read in light of the accompanying drawing 

:S in which: 

u Fig. 1 is a diagram illustrating three interconnected nodes that are involved in a 

20 deadlock condition; 

Fig. 2 is a diagram illustrating a node having high and low priority buffers, in which 
a livelock condition may occur; 

Fig. 3 is a diagram illustrating two interconnected nodes, and the forward data link 
and feedback control link connecting them; 
25 Fig. 4 shows a network of three serially interconnected nodes, and, in accordance 

with the present invention, the "Level Table" associated with those nodes, showing the 
"level" assigned to packets buffered in those nodes based upon the destinations of the 
packets; 

Figs. 5a and 5b show two instances of the same sample 10-node network, with 
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specific routing paths to reach destinations Di and D 2 , respectively; 

Fig. 6 is a diagram illustrating the crosspoint switch portion of a node arranged in 
accordance with the present invention, showing the various virtual queues within the switch; 

Fig. 7 is a diagram illustrating the segmentation of the receive buffers within the 
switch of Fig. 6; 

Fig. 8 is a diagram illustrating the buffer management parameters assigned, in 
accordance with the present invention, to the buffer of Fig. 7; 

Fig. 9 is a flow chart illustrating the send functions performed at a sending node 
; and 

Fig. 10 is a flow chart illustrating the receive functions performed at a receiving 
node R^ . 

DETAILED DESCRIPTION 

Referring first to Fig. 3, consider a packet network that includes two nodes 301 and 
302 connected by a one-way communication link £. The sending node 301 is designated by 

, the receiving node 302 is designated by R^ , and the communication link going in the 
reverse direction, from R^ to X^ , by £ Let Si denote the scheduling algorithm of link 
I, i.e. the algorithm employed at node X/ to select the next packet from those buffered at 
X i for transmission over £ . As in prior art arrangements, the scheduling algorithm St 
could possibly base its selection on a number of factors, including packets' order of arrival, 
service priorities, service deadlines, and fairness considerations. The present invention 
enhances any prior art scheduling algorithm Si and avoids packet losses in the network by 
employing a selective backpressure congestion control mechanism for each link £ , to 
control the eligibility of packets to be transmitted over link £ . In this arrangement, before 
the buffer at the receiving node R £ of a link £ overflows, a stop feedback is sent to the 
sending node X £ , over the reverse link £ '. Unlike prior art backpressure mechanisms, the 
present invention is arranged to avoid the occurrence of deadlocks and livelocks in the 
network. 

Under normal, uncongested conditions, all packets waiting at a sending node to be 
sent over link £ to R^ , are eligible for transmission and may be selected by the link 
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scheduling algorithm St . However, as the buffer at the receiving node R^ gets congested, 
e.g., fills up close to its capacity, the set of packets that are eligible for transmission (i.e., 
those packets that may be selected by the scheduling algorithm St to be sent over link £) is 
gradually restricted. This is accomplished by sending a Transmit Feedback parameter , to 
be discussed shortly, over the communication link I ' in the reverse direction, from R^ to 
. As the congestion at node R^ subsides, the restriction placed on the transmission of 
packets over i is gradually increased by designating more packets as eligible using a new 
value of Transmit Feedback ft . 

Let D denote the maximum number of hops in any legitimate network route (or an 
upper bound on the number of hops if the maximum is, a priori, unknown). D depends on 
the network topology and routing protocol. For instance, if shortest path routing is 
used, then D is the diameter of the network. We associate with each packet p buffered at a 
node an integer value between 0 and D, inclusive; we call that assigned value the level of p 
and denote it as \>. When a packet is forwarded in the network from one node to another, 
no information about the level that was assigned to the packet in the first (sending) node is 
carried along with it. For instance, packet levels are not included in the packet headers 
(in contrast with, for example, prior deadlock-prevention schemes that carry hop counts in 
packet headers). To carry such information would require a change in existing Ethernet 
standards. Nonetheless, as will be discussed below, the present invention allows the 
receiving node to infer some partial information about the level that the packet 
held in the previous node. Typically, the level assigned to the packet in the new (receiving) 
node is different than its previous level. 

Like packet levels, the Transmit Feedback f^ also assumes an integer value between 
0 and D, inclusive. The eligibility of packets for transmission over each link i is 
determined by the value of the corresponding Transmit Feedback f^ in accordance with the 
following rule: 
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Transmit Eligibility Rule: 

A packet p waiting at node X^ is eligible to be picked up by the scheduler of link £ 
for transmission over £ y if its current level Xp satisfies 

A,*fi, 0) 

where f* is the most recent Transmit Feedback received by from the receiving node 
R^ It follows from this rule that when f^ =0, all packets are eligible for transmission over 
£. As f / increases, the protocol becomes gradually more restrictive and fewer and fewer 
packets are considered eligible for transmission over £ . 

In a real network, it takes some time until a Transmit Feedback f^ sent by 
R^ reaches X t . Likewise, the effect of eligibility designations made at X^ , as the result of 
the Transmit Feedback received from , does not reach R; until after a delay related to 
the propagation time of £ . In order to focus on the essence of the present invention, and 
not be sidetracked by secondary issues, we assume that the propagation delays of all 
network links are zero and that the Transmit Feedback generated at a receiving node R^ is 
instantaneously sent to and detected by the sending node X^ . Later, we will discuss 
necessary modifications in the protocol to accommodate real network scenarios where these 
conditions are not met. 

Next, we describe how the level of a packet arriving at a node is determined. 
Let packet p arrive at node R^ over link £, and assume that f^ is the most recent Transmit 
Feedback sent to X / . It follows that the level of p prior to transmission from the 
previous node (X/ ) must have been f^ or larger. To guarantee freedom of deadlock in the 
network, it suffices to assign level 1 + f/ to packet p (as well as follow the buffer 
management and feedback rules described below). However, this simple approach can lead 
to the assignment of different levels, at a given node, to packets of the same session, which 
in turn can result in misordering of these packets when they are forwarded to the next 
downstream node. 

To avoid misordering of packets belonging to the same session, it is advantageous 
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(but not essential) to adopt the following principle in the implementation of the present 
invention: at each node and at each point of time, all buffered packets that have a common 
destination should have the same level (so that all will be eligible/ineligible at the same 
times, and therefore selected for transmission in the correct order). This principle may be 

5 accomplished at each node by increasing the level of all packets with a common destination 
to the highest level among them (which also potentially increases their opportunities to be 
eligible for transmission). With this modification, it is more appropriate to view the level 
assignments at a node as being performed on a per-destination, rather than per-packet, 
basis. In accordance with this viewpoint, at a given node, let us denote the level associated 

10 with a destination d, by X d . The selective backpressure protocol of the present invention is 
based on the following rules for the assignment and updating of these destination levels. 

Level Assignment Rules: 

1 . Every node n maintains a list of all destinations d that it encounters, along with the 

15 associated level X d . We refer to this list as the Level Table of node n. Initially, there are 
no entries in the Level Table. 

2. By default, the level of any destination not included in the Level Table is zero. 



Accordingly, any destination which has a level equal to zero may be eliminated from the 



20 3 . At any point of time, the level \ of each packet p which is buffered at node n is equal to 
the level X d associated with the corresponding destination d, at that time. 
4. When a packet p with destination d arrives from another network node over some link 
£ , the level associated with d is updated as 



where f^ is the value of the most recent Transmit Feedback sent over the reverse link 

V. 

30 5. When a packet p with destination d enters the network at node n (over some network 



Level Table. 



25 




(2) 
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access link) the level X d does not change. For instance, when a packet p arrives with 
destination d and that destination d is not included in the Level Table for node /?, then 
the level X d remains set to zero. 
6. When air buffered packets with destination d have left node n, it is permissible (but not 
necessary) to eliminate destination d from the Level Table at n. In other words, it is only 
necessary to keep values in the Level Table for destinations of currently-buffered 
packets. Note from rule 2 above that such elimination is equivalent to resetting X d to 
zero. It also serves to automatically "refresh" entries in the Level Table, which is needed 
since the topology or routing paths may change over time. 

Before proceeding, we present a few important observations regarding the above rules: 

• If all packets encountered by node n and destined for destination d enter the network at 
«, then X d is always equal to zero since it is never subjected to the update in rule 4 
above. This means that packets arriving into the network at node n and destined for d 
will assume level zero at w, provided that n does not encounter any traffic destined for d 
that comes from another network node. 

• When a packet arrives at node w from another network node, it will be assigned with a 
level of at least 1, since the level associated with its destination will undergo the update 
in rule 4 above. 

• Updating according to the above rules will never result in a level larger than D. By the 
time the level associated with a packet reaches D, the packet must have reached its 
destination. Typically, packets' levels are less than D when they reach their destinations. 

• Nodes do not need to keep track of the levels associated with each and every individual 
buffered packet. Each node only needs maintain a short Level Table listing the 
destinations for which packets are buffered at the node. 

To illustrate how levels are assigned, Fig. 4 shows a network example of various 
levels and Transmit Feedback values. Three serially interconnected nodes A, B and C are 
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shown, with the heavy line from left to right indicating the packet data path, and the lighter 
line from right to left indicating the feedback path. Nodes X, Y and Z are assumed to be 
destination nodes that are interconnected with the nodes A, B and C by yet other parts of 
the network- that are not shown. In other words, we are assuming for illustration purposes 
5 that a routing process will have certain packets destined for nodes X, Y and Z pass through 
nodes A, B and C in order to get to their final destinations. 

As shown in Fig. 4, node A has permission to transmit a packet destined to node X 
to node B, because the level associated with X in node A's level table (i.e., "1") is not less 
than the Transmit Feedback parameter (i.e., "0") returned to node A via the reverse link to 

10 node B. When the same packet reaches and is buffered at node B, the level of the packet is 
increased to "4", which is the value associated with destination X in node B's Level Table. 
This is because of Rule 4 above. Similarly, node B has permission to transmit the packet . 
destined to node X to node C, because the level associated with X (i.e., "4") is not less than 
the Transmit Feedback parameter (i.e., "1") received in node B via the reverse link from 

15 node C. When the packet reaches and is buffered at node C, the entry for destination X in 
node Cs level Table was "1". This value, however, is increased to "2" (which automatically 
increases the level to "2" of any other packets in the node C buffer destined for node X), 
again as a result of the application of Rule 4 above. Notice in this simple example that the 
level of the packet destined for node X went from "1" to "4" to "2" as it passed from nodes 

20 A to B to C. 

The example of Fig. 4 shows that a packet's level is quite "volatile" as it travels 
through a network; the level can even increase while it is buffered at a node (as at node C in 
Fig. 4). One of the few things that can be said about a packet's level is that it is less than or 
equal to D minus the number of hops remaining to the packet's destination. As an 
25 illustration, Figs. 5a and 5b show two instances of the same sample 10-node network, with 
specific (exemplary) routing paths to reach destinations Di and D 2 , respectively. Inside 
each node is an integer that indicates the maximum possible level of packets destined to Di 
or D 2 . Consider Fig. 5a first. Note that every node 501 -5 10 has some path to reach 
destination Di, which is node 510. At a leaf of the routing tree (e.g., nodes 501 and 504), 
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the level is always zero. A packet entering the network at node 507, however, might be 
assigned a level as large as u 3" because of the impact on node 507's Level Table from 
packets that entered the network at a previous node, such as node 501 and that may have 
attained a level equal to 3 at node 507. Likewise, the packets that enter at node 504 (with 
5 their levels initially set to "0") might have their levels increased to "3" when they reach node 
507. Similar observations can be made with the set of paths in Fig. 5b that route packets to 
node 501 (which is labeled as D 2 in Fig. 5b). In this example, nodes 506, 509 and 510 are 
leaf nodes, and a packet going from node 506 to node 502 would have its level increased 
from u 0" to "4" as it travels toward node 501. 

10 In summary, a packet's level increases and decreases as a function of the topology, 

the distance the packet travels, the "recent history" of arrivals (throughout the network), 
and the state of network congestion. 

Now that we have described the concepts of packet (or destination) levels, the 
Transmit Feedback, and the Transmit Eligibility Rule, we explain in the next few paragraphs * 

15 how the values for the Transmit Feedback are selected. First, though, we need to briefly 
describe the general switch architecture under consideration. 

The selective backpressure technique of the present invention can be used in 
networks with arbitrary switch configurations. So, we consider a general switch model for 
each node of the network, as illustrated in Fig. 6. The switch includes a cross point matrix 

20 selectively interconnecting N input links 60 1 - 1 through 60 1 -N with M output links 610-1 
through 6 10-M A virtual input-output queue 620-1, 1 through 620-N,M is associated with 
each input-output pair (ij), a virtual Receiving Queue 630-1 through 630-N is associated 
with each incoming network link, and a virtual Sending Queue 640-1 through 640-M is 
associated with each outgoing network link. The Receiving Queue 630 associated with an 

25 incoming link I is used for determining the Transmit Feedback f^ , and the Sending Queue 
640 is used with its associated scheduling algorithm in determining the next packet to be 
transmitted - selecting from those packets that are eligible, including those generated at the 
node and stored in the buffers assigned for traffic entering the network at the node. Note 
here that Fig. 6 does not explicitly show the buffers assigned for the traffic entering (and 
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also leaving) the network at the node (i.e., over some network access link). Note also that 
in the switch model illustrated in Fig. 6, each packet is at the same time treated as belonging 
to both a Receiving and a Sending Queue. We assume that the scheduling algorithm S e is 
well-behavetff meaning that eligible packets in each Receiving Queue (i.e., arriving from 
each of the input links) will eventually be selected for transmission over an output link. 
Otherwise a livelock would result if the link scheduling algorithm S 4 continually refused to 
select a particular eligible packet for transmission over an output link. 

This general switch model of Fig. 6 can be physically implemented in many ways, 
which is why it is important to state that the defined input-output queues 620, the Receiving 
Queues 630, and Sending Queues 640 are virtual. For instance, in a completely-shared- 
. memory switch, all Input-Qutput, Receiving, and Sending Queues are maintained in lists as 
packets arrive on various incoming links and depart on various outgoing links. In an input- 
buffered (output-buffered) switch, however, the Receiving (Sending) Queue could be a 
physical buffer and the other Queues would still be virtual entities that are maintained by, 
for example, keeping lists of packets destined for (coming from) various outgoing 
(incoming) links. 

Now consider again an arbitrary network link £ and the receiving node R^ . We 
refer to the input buffer at R; that is associated with the incoming link £ as the Receiving 
Queue of £ . Let the size of this buffer be denoted by b e , and let denote the maximum 
size of a packet in the network. Our core idea for deadlock prevention is to manage the 
Receiving Queue of each link £ and to set the value of the Transmit Feedback f/ as 
described in the following paragraphs. 

As illustrated in Fig. 7, we divide a buffer 701 having a total buffer size b l into D 
parts bi, i=l,2,...,D satisfying 

Mr-*, 0) 

A' =f*, (4) 
This division is not a physical partitioning of the buffer space; it is only an allocation of 
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space. Note here that in reality, the size b l of buffer 701 is only weakly dependent on the 
maximum route length D. Most of the buffer space is in bi (i.e., typically bj < b { for j > 1) 
and the partitioning is "virtual." 

We refer to b; as the buffer budget of level /' and require that a packet of level / be 
accepted into buffer 701 only if there is enough budget available for it at levels / or below. 
Let nj, /" =1, 2,.. .,D denote the combined size of packets of level /' that are stored in the 
Receiving Queue of link £ . The above requirement may be stated as 

rh<b xy (5) 
n,+w 2 <^+^ 2 , (6) 

ijjj or, more generally 

1 / = 1,2,-..Z>. (7) 

Ln Equivalently, we may define buffer threshold constraints Bi = 2^bj that limit the 



15 maximum buffer capacity that can be occupied by packets of level less than or equal to i. At 

all times, all Bi buffer threshold constraints must be satisfied (as in Equation (7)). Note that 

B D = , the size of the buffer. 

In order to observe the above requirements (to prevent deadlocks and livelocks), it 

is not necessary to physically partition the Receiving Queue into different segments. Instead, 
20 as illustrated in Fig. 8, which is another view of buffer 701 of Fig. 7, the above requirement 

may be implemented using a set of buffer management parameters mi defined as 



A 

m, 



Equivalently, m; = Bj - £ rtj . m; refers to that part of the combined buffer budget of levels 

j< i, which is not allocated to packets of levels j<i. This means that out of the combined 
25 buffer budget of levels j<i, a budget mi is either allocated to packets of levels j > i or not 
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allocated to any packets at all. With this notation, m D equals the total size of the vacant 
space in the buffer. Notice that since packets of level j can use the buffer budget of any level 
k< j, the term bj - nj in (8) can be negative, for some j. However, m* cannot be negative for 
any / since packets of levels j<i cannot use the buffer budget of a level higher than /. 
5 It follows fr om (8) that when a new packet of length y and level j is stored in the 

buffer (leaves the buffer), m; must be decreased (increased) by y for all i > j. Similarly, 
when the packet's level is increased from j to k t then according to (8), m; must be increased 
by y for all levels /', k > i > j. 

The above results also provide the guideline for choosing the Transmit Feedback f; 

10 to be sent over the reverse link £ ' to the upstream node. Since the parameters rrii should 
always be nonnegative, the buffer can store a new packet of level j and arbitrary length, 
provided that currently m; > Ym« for all i> j. On the other hand, when a packet arrives 
following the sending of the Transmit Feedback f / , the level assigned to it could be as low 
as 1 + f / . Since we would like to set the value of the Transmit Feedback f^ as low as 

15 possible, we conclude that f t should be set to a level j such that mi > y^ for all i> j+1 , 
and mj < Ynux . It follows that f / should be set to the highest level j for which mj < . 
Accordingly, if m\ > y^ for all / = 1 , 2, . . . D, then we set f t =0, allowing the transmission 
of packets of any level, over link £ . Conversely, if mo < y^ , then it follows that f^ =D. 
As we said before, if there is any packet of level D in the upstream node, it must be destined 

20 for that node itself. Therefore, all packets waiting to be sent over link £ must have a level 
less than D. It follows that, with the Transmit Feedback set to D, no packet is eligible to be 
sent over £ . 

We now summarize the rules to be followed in accordance with the present 
invention for feedback setting of link £ and management of its Receiving Queue. 

25 

Buffer Management Rules: 

• When the Receiving Queue of link £ is empty, initiate 
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w i=2>/> '" = 1,2,-,Z>. (9) 

>=i 

• If a packet of length y arrives and is buffered at level j, decrease m; by y, for i >j. 

• If a packet of length y and level j leaves the buffer, increase by y, for i >j. 

• If a.packet of length y is increased from level j to level k >j, increase m; by y, for all /, 
such that k > i >j. 

Transmit Feedback Rule: 

Set link I 's Transmit Feedback f t - j, where j is the largest level for which mj < 
Ymax. If no such level exists, set =0. 

Finally, it is to be noted that, as the result of applying the selective backpressure 
technique in accordance with the present invention, by selectively designating packets as 
eligible for transmission, the present invention improves the normal operation of scheduler 
*S / of each link i . This improvement is obtained because, through the use of our eligibility 
control mechanism, the order in which packets waiting for transmission are in fact 
transmitted, is changed from the order that would otherwise occur through the operation of 
scheduler S / . This is in contrast to the performance of a plain backpressure mechanism, 
which does not change the order of packet transmissions over a link, since either all packets 
waiting at a link are eligible for transmission or no packet may be sent at all. 

Fig. 9 is a flow chart illustrating the send functions performed at a sending node 
. In step 901, a determination is made as to which packets are eligible for transmission 
over link I , by determining if the priority level for a packet is greater or equal to the 
feedback level, or X p > f ti as in Equation 1. Any arbitrary scheduling algorithm S / is then 

used to select the next packet for transmission from among those that are eligible, in step 
903, whereupon the process of Fig. 9 returns and repeats step 901. 

Fig. 10 is a flow chart illustrating the receive functions performed at a receiving 
node . In step, 1001, the buffer threshold constraints Bj are initialized. Then, in step 
1003, the priority level A, d of packets destined for destination d is updated. Next, in step 
1005, the counters n, that track the total buffer space occupied by packets with the priority 
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levels /, are updated. Note here that at most two counters iij can possibly change: if the 
incoming packet results in the priority level X d being raised to 1+ f / , then the counters 
associated with both the previous (prior to arrival of the incoming packet) and new priority 
levels are changed; alternatively, if 1+ is less than or equal to the previous priority level 
X d , then only the counter n; associated with that priority level is changed. 

In step 1007, a new feedback signal f^ is sent over reverse link £ ', if necessary, i.e., 
if the status of any buffers have changed such that the feedback signal has changed in 
accordance with the requirement that the feedback f^ sent from the receiving node R^ to 
the sending node X t represents the lowest priority level of packets that the receiving node 

could accept without violating any of the Bi buffer threshold constraints. In other 
words, the feedback value must be set such that the receiving node R^ will have room to 
accept packets of priority level (1+ f^ ) or greater, without violating any of the buffer 
threshold constraints. The process then returns to step 1003, where the process is repeated 
when subsequent packets are received. 

As indicated previously, one major advantage of the present invention is that the 
sending of the Transmit Feedback signals to the upstream neighbor is easily implemented in 
currently available IEEE 802.3 gigabit Ethernet, using the standard PAUSE frames. The 
invention can also be used in other networks that use different methods to signal congestion 
status to upstream nodes. Rather than coding the PAUSE frames parameter to represent 
the period of time that the upstream neighbor should not send data frames, the parameter is 
coded to represent the various Transmit Feedback values. 

Various alternatives and extensions of the present invention are possible. For 
example, up to this point, we have ignored the effects of propagation delays. The basic 
modification needed to incorporate propagation delays into the present invention is to make 
sure that nodes transmit their feedback signals with enough "lead time" so that the controls 
"take effect" at the appropriate moments (assuming worst-case conditions). For example, if 
a node determines that for a given incoming link a particular Transmit Feedback signal 
needs to take effect at time to, then it will need to transmit the signal at time to - T, where T 
is the round-trip propagation time of that link. As the buffer occupancy continues to 
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change, note that a node might need to transmit an updated Transmit Feedback between the 
time to - T and the time t 0 . 

Specifically, very little needs to change to incorporate the effects of a round-trip 
delay T in the present invention. The Transmit Eligibility Rule and the Buffer Management 
5 Rules are unchanged. Only the Transmit Feedback Rule and the Level Assignment Rules 
need to be slightly modified. First, in determining what Transmit Feedback to send at time 
t 0 , the receiving node R^ needs to consider the "worst-case" (maximum) occupancy of its 
Receiving Queues at time t 0 + T in the future. If the link rate from X/ to R^ is r, then the 
occupancy of a Receiving Queue may increase by at most rT. Consequently, the modified 
10 rule is: 

Transmit Feedback Rule: 

Set link £ 's Transmit Feedback f^ = j, where j is the largest level for which mj - rT 
< Ymax If no such level exists, set ft =0. 

15 If desired, more complex "book-keeping" algorithms can also be incorporated to 

make more efficient use of memory and bandwidth resources. Using techniques similar to 
those proposed in "A New Feedback Congestion Control Policy for Long Propagation 
Delays", by I. Iliadis, IEEEJ. Select. Areas Commun., Vol. 13, pp. 1284-1295, September 
1995, and information about the recent history (from time to - T to time to) of feedback 

20 signals that were transmitted on a link to an upstream neighbor can be used to compute an 
upper bound (perhaps, less than rT) on the amount of information that will reach the 
downstream node between time to and time to + T. 

Second, the (modified) Level Assignment Rule is based on the current buffer 
occupancy (just as in the T = 0 special case considered previously): 

25 

Level Assignment Rule: 

When a packet p with destination d arrives from another network node over some 
link £ t the level associated with d is updated as 

k* <-max(A d ,l + j\ (10) 
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where j is largest level for which currently rrij < Ymax (if no such level exists, set j =0). 
Note that because we make worst-case assumptions in the (modified) Transmit Feedback 
Rule, we ca%assign levels to packets at time to without needing to remember what Transmit 
Feedback value was sent at time t 0 - T. Note that the Transmit Feedback value j* at time t 0 
is greater than or equal to the j value in the Level Assignment Rule at time to + T. 
This greatly simplifies the implementation. 

What happens if the receiving node R; has an incorrect estimate of the round-trip 
propagation delay T? If the actual delay is greater than the estimate, then some packet loss 
may occur and deadlocks are possible. On the other hand, if the actual propagation delay is 
less than the estimated value used by node R^ , then the only penalty is a (slight) loss in 
efficiency and perhaps some wasted buffer. Consequently, in local and metropolitan area 
network applications where link propagation delays are small, it might be preferable to 
select some upper bound on link propagation delays of the network as the parameter T used 
on all links - just as there are maximum distances for links due to physical constraints. If the 
protocol is used in wide area networks, however, it is probably better to individually set the 
T parameter for each link. Finally to guarantee no packet loss in the presence of nonzero 
propagation delays, the buffer size must be increased such that bi > y^x + rT. Equation 
(3), constraining bi, still holds for /' = 2, 3, . . . ,D. Another option in dealing with long- 
propagation environments is to reinterpret some of the Transmit Feedback values (most 
likely the higher values) as permission to transmit up to a specified maximum number of 
packets of specified levels, rather than an unlimited number of packets. Additional Transmit 
Feedback messages would need to be sent to grant permission to transmit additional 
packets. 

Before briefly presenting a few other possible extensions of the present invention, it 
is important at this point to mention the relationship of the invention with end-to-end 
congestion control techniques. While the present invention addresses the problems of 
short-term congestion overflows and deadlocks, it does not address the issue of fairness in 
providing service to different users. One way of resolving this shortcoming is to couple this 
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technique with end-to-end congestion control schemes that handle congestion problems on 
a quasi-static basis while providing the desired fairness and/or priorities in the amount of 
services given to different users in the long run. This can be accomplished in a number of 
ways that wift'be apparent to persons skilled in the art. (See, for example, "A Class of End- 

5 to-End Congestion Control Algorithms for the Internet" by S. J. Golestani et al., Proc. 6 th 
International Conf. On Network Protocols, October 1998.) 

Finally, we mention a few more possible extensions of the present invention. First, it 
is sometimes advantageous to incorporate some packet dropping to address other network 
issues, such as aging of packets due to errors, and blocking caused when certain network 

10 links are "overloaded". (See "A Simple Technique that Prevents Packet Loss and 

Deadlocks in Gigabit Ethernet", by M. Karol et al, Proc. 1999 International Symposium on 
Communications (ISCOM'99), pp. 26-30, November 1999.) Second, if it is desired to 
internetwork the proposed lossless network with a network (e.g., TCP) that depends on 
losses to rate control its sources, then it is simple to design a "gateway" between the two 

15 networks. For instance, an edge device near a TCP source, or near the WAN, could be used 
to convert the lossless LAN's end-to-end technique to the TCP "loss technique" (i.e., by 
dropping packets) to implicitly rate control the TCP sources. Third, the present invention 
can be modified such that some Classes of Service (CoS) will be allowed to "ignore" the 
Transmit Feedback congestion control signals that gradually restrict the set of packets 

20 eligible for transmission. This is possible if, perhaps, there are dedicated buffers set aside 
for them. Alternatively, all nodes might be programmed to always treat particular CoS 
packets to be of no less than a certain minimum level. 



